A Hybrid Morphologically Decomposed Factored Language Models for Arabic LVCSR
نویسندگان
چکیده
In this work, we try a hybrid methodology for language modeling where both morphological decomposition and factored language modeling (FLM) are exploited to deal with the complex morphology of Arabic language. At the end, we are able to obtain from 3.5% to 7.0% relative reduction in word error rate (WER) with respect to a traditional full-words system, and from 1.0% to 2.0% relative WER reduction with respect to a non-factored decomposed system.
منابع مشابه
Morpheme level hierarchical pitman-yor class-based language models for LVCSR of morphologically rich languages
Performing large vocabulary continuous speech recognition (LVCSR) for morphologically rich languages is considered a challenging task. The morphological richness of such languages leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities. In this case, the use of morphemes has been shown to increase the lexical coverage and lower the LM perplexity. Another approach ...
متن کاملImprovements in RWTH LVCSR evaluation systems for Polish, Portuguese, English, urdu, and Arabic
In this work, Portuguese, Polish, English, Urdu, and Arabic automatic speech recognition evaluation systems developed by the RWTH Aachen University are presented. Our LVCSR systems focus on various domains like broadcast news, spontaneous speech, and podcasts. All these systems but Urdu are used for Euronews and Skynews evaluations as part of the EUBridge project. Our previously developed LVCSR...
متن کاملSub-word based language modeling of morphologically rich languages for LVCSR
Speech recognition is the task of decoding an acoustic speech signal into a written text. Large vocabulary continuous speech recognition (LVCSR) systems are able to deal with a large vocabulary of words, typically more than 100k words, pronounced continuously in a fluent manner. Although most of the techniques used in speech recognition are language independent, still different languages are po...
متن کاملMorpheme Based Factored Language Models for German LVCSR
German is a highly inflectional language, where a large number of words can be generated from the same root. It makes a liberal use of compounding leading to high Out-of-vocabulary (OOV) rates, and poor Language Model (LM) probability estimates. Therefore, the use of morphemes for language modeling is considered a better choice for Large Vocabulary Continuous Speech Recognition (LVCSR) than the...
متن کاملFactored recurrent neural network language model in TED lecture transcription
In this study, we extend recurrent neural network-based language models (RNNLMs) by explicitly integrating morphological and syntactic factors (or features). Our proposed RNNLM is called a factored RNNLM that is expected to enhance RNNLMs. A number of experiments are carried out on top of state-of-the-art LVCSR system that show the factored RNNLM improves the performance measured by perplexity ...
متن کامل